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Abstract — A new flip-flop design using a double-pulsed static latch is 
presented. The flip-flop has only a single stage of logic in the critical path 
and as a result is up to three times faster than the fastest previously known 
flip-flops, while consuming approximately the same energy as the lowest- 
power flip-flops. The flip-flop has asymmetric timing properties which 
make it a good match to skewed logic styles. A novel dual-pulse genera- 
tor further reduces power requirements. 

Index Terms — flip-flop, pulsed latch 



I. Introduction 

Flip-flops are critical timing elements in digital circuits and 
have a large impact on circuit speed and power consumption. 
Consequently, extensive research has been performed to de- 
velop fast and low-power flip-flops [1], [2], [3], [4]. The pri- 
mary measure of performance of a flip-flop is the minimum D- 
to-Q delay [3], as this determines how much impact the flip- 
flop has on cycle time. Recently, pulsed latch structures have 
emerged as the fastest known flip-flop structures [1], [2]. By 
reducing the transparency period of a latch to a narrow window, 
the latch can operate as a flip-flop with the additional advantage 
of allowing limited time-borrowing across cycle boundaries to 
reduce sensitivity to clock skew and jitter. These structures have 
the disadvantage of large positive hold times which complicates 
timing verification. The pulse generators can also consume con- 
siderable energy as pulses must be generated locally to avoid 
pulse distortion. Nonetheless, because of their performance ad- 
vantages, these pulsed latch structures have been used in several 
commercial high-performance microprocessors [5], [6]. Apart 
from raw performance and energy consumption, other attributes 
are used to evaluate flip-flop structures including robustness, 
compatibility with high-performance logic families, and ability 
to embed logic into the flip-flop, 

In this work we introduce a new flip-flop structure, 
the double-pulsed set-conditional-reset flip-flop (DPSCRFF), 
which is up to three times faster than the fastest previously 
known flip-flops while consuming the same power as the 
lowest-power flip-flops. The DPSCRFF is a single-ended static 
flip-flop design with a single logic stage which can include ar- 
bitrary logic functionality. The DPSCRFF is compatible with 
static or dynamic logic, and in particular can directly drive fol- 
lowing dynamic logic. 

II. DPSCRFF Design 

Fig. 1 shows the design of the DPSCRFF. The DPSCRFF 
is composed of two pieces: a static set-reset latch and pulse- 
generator. Fig. 2 shows the operation of the static latch. The 
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(a) DPSCRFF using conventional pulse generation 
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(b) DPSCRFF using novel pulse generation 



Fig. 1. DPSCRFF 




Fig. 2. DPSCRFF operation 



latch requires two clock pulses, pi and p2, which are generated 
from the active clock edge. The first pulse presets the output 
node high using the p-type pull-up. The second pulse condition- 
ally resets the output node, based on the value of the data input. 
The precharge causes a glitch at the output node whenever the 
output is supposed to remain low, which is further discussed be- 
low. An additional inverter can be added to the output stage to 
isolate the storage node from the output load. 

The path from input to output is only a single stage of logic 
which is the key to the design's high-performance. In addition, 
arbitrary logic can be embedded into the pull-down stack, sim- 
ilar to a domino pull-down tree, as shown in Fig. 3. Another 
advantage is that the data input sees only a single transistor load 
which reduces required input drive and energy consumption. 
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Fig. 4. DPSCRRFF shift register 



Fig. 3. A 2-input mux embedded in the DPSCRFF 



III. Double Pulse Generator 

The two pulses are generated by a local pulse generator to 
avoid pulse distortions from additional pulse buffers and wiring. 
The pulse generator can be shared by a few neighboring flip- 
flops to reduce pulse generator area and energy overheads. The 
width of the pulses is controlled by the inverter delay chain. The 
inverters in the chain can be skewed to control the lengths of pi 
and p2. The width of p2 determines the transparency window 
of the latch. To reduce setup and hold time requirements, p2 
should be made as small as possible. However, if p2 is too short, 
the circuit will not function. Detailed simulation at all process 
corners and careful control of clock pulse loading will ensure 
proper functionality. 

The conventional way to generate a pair of pulses uses an 
inverter delay chain as in Fig. 1(b). This design has a large 
number of intermediate nodes and thus dissipates a significant 
amount of energy. Our alternative design reduces the number 
of intermediate nodes by using an inverter delay chain both to 
generate p2 and to turn off pi. As shown in Fig. 2, intermediate 
node X is precharged high during the low phase of the global 
clock. When the clock rises, pi falls. After some delay, p2 
rises. This causes node X to discharge, causing pi to rise. After 
some delay, p2 falls. Note that node X floats in the low state 
until the global clock goes low. This can be a concern if the 
global clock is held high for a long time. In this design, pi and 
p2 overlap by some amount. This causes some overlap current 
in the latch when the data input is high. However, the extra 
energy dissipation caused by the overlap current is not too great, 
and is much less than the energy savings from using this pulse 
generator design. It is possible to design the pulse generator to 
separate the pulses, but the energy cost to separate the pulses 
with a longer inverter chain is larger than that of the overlap 
current. 

IV. DPSCRFF Timing Analysis 

The DPSCRFF has asymmetric timing properties. A low in- 
put propagates through the flip-flop in negative time as the out- 
put is preset at the start of pi. A low input must be setup by the 
start of the second sampling pulse p2, and the hold time lasts 
for the duration of p2. A high input, however, can arrive later 
during the transparency period p2. The hold time of the high 
input just has to be large enough to switch the state of the static 
latch. The high value will still be correctly registered at the end 



of p2 even if the high value drops low again during p2. 

The asymmetic timing properties can be exploited in skewed 
static logic and dynamic domino logic styles. In particular, tran- 
sistors on the fast edge path of a DPSCRFF output can be sized 
down. This reduces the capacitive load on signals, reducing 
power improving the performance of the slow edge paths. A 
skewed static logic cell library was used in the design of the 
Z900 microprocessor to achieve full custom-like circuits [7]. 

Fig. 4 shows two DPSCRFFs connected as a shift register to 
illustrate hold time violations. Consider the state just before a 
clock edge, when the first DPSCRFF had a reset value on Qb. 
This will be propagating through the combinational logic to the 
input of the second DPSCRFF. At the clock edge, pulse pi is 
generated and the first DPSCRFF will begin propagating a pre- 
set value from its output before the second DPSCRFF has sam- 
pled its input using pulse p2, potentially causing a hold time vi- 
olation. A conservative approach would be to require sufficient 
logic levels between DPSCRFFs such that the preset value initi- 
ated by pulse pi could not arrive at the second DPSCRFF until 
the end of pulse p2. A more aggressive approach takes advan- 
tage of the asymmetry of the sampling input. If there are an odd 
number of inverting logic levels between the two DPSCRFFs, 
then the high-going preset value from the first DPSCRFF even- 
tually propagates into a low-going value at the input to the sec- 
ond DPSCRFF. This low-going value will not cause a hold-time 
violation even if it arrives before the end of p2, provided that the 
previous input was high long enough to flip the latch state. In 
our technology, we found that five levels of F04-loaded invert- 
ers between DPSCRFFs were sufficient to ensure no hold-time 
violations across PVT corners with ample margin (three levels 
just failed in one process corner). 

This DPSCRFF does not allow arbitrary time borrowing 
across the transparency window as with other pulsed latches. 
Time borrowing is only possible for late arriving high inputs, 
e.g., from a preceding domino logic stage or a preceding skewed 
static logic stage. 

The output of the DPSCRFF has a glitch in the case where 
the output Qb is to stay low, i.e., the input remained high. The 
precharge pulse pi first forces the output high before the data 
input resets the output. This glitch can cause additional power 
dissipation in downstream logic. There is a tradeoff between the 
additional power dissipation caused by the glitch, and the pos- 
sible power savings the glitch provides by enabling the use of 
highly skewed static logic. This is similar to the energy trade- 
offs of precharged domino logic versus static logic. 
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Fig. 5. DPSCRFF with domino logic 



Interfacing to Domino Logic 

Fig. 5 shows a DPSCRFF interfacing to domino logic at its 
input and output. By adding an output inverter, the DPSCRFF 
can be treated as another domino logic stage. The monotonic 
rising output of a preceding domino gate can arrive late into the 
p2 sampling period of the DPSCRFF, reducing effective setup 
time. The pulsed preset value on the output of a DPSCRFF 
also simplifies driving a following domino gate. The following 
domino gate does not have to wait until the worst case Clk-to-Q 
of the flip-flop to enter evaluate, as the DPSCRFF will first set 
its output inverter low then give a monotonic rising output in the 
same way as a domino gate. However, note that the clock signal 
input to the domino logic is a delayed (or inverted) version of 
the global clock used by the pulse generator. 

VI. Evaluation Methodology 

The DPSCRFF, along with other previously published de- 
signs (Fig. 6) [8], were simulated using HSpice from schematic 
netlists annotated with accurate source/drain parasitic diode pa- 
rameters using a TSMC 0.25 yum process. Fig. 7 shows the test- 
bench used for the evaluation. The testbench is based on that 
in [8]. However, we chose more balanced 2/1 inverters instead 
of minimum sized inverters in the data and clock buffers. As in 
[8], [3], we subtracted out the energy dissipated in charging and 
discharging the output load capacitors. In addition, as in [3], we 
also subtracted out the energy dissipated in the input buffers. 

The relative ranking of flip-flops depends on the loading con- 
ditions assumed [9]. For this evaluation, we chose a load of 
(7.2 fF) which corresponds to four minimum sized inverters in 
this technology. This represents a typical light load in a datapath 
structure [8]. To drive higher loads, it is likely that additional 
levels of output buffering should be used [9]. 

The pulse generators of the DPSCRFF and the SSASPL were 
connected to four of the flip-flops, and the energy cost of the 
pulse generation is considered to be amortized between them. 

The transistor sizes in the designs were each optimized for 
several design points. This optimization was performed using 
data inputs that were stable well before and after the arrival of 
the clock. The clock was ungated and the data alternated on 
every cycle. Clk-to-Q delay and energy were measured. Af- 
terward the minimum D-to-Q delays were found by optimizing 
the data input arrival times. The minimum D-to-Q delay is the 
best metric in measuring the performance of timing-elements as 




(d) Hybrid Latch Flip-Flop (HLFF) 



Fig. 6. Flip-flops for comparison 




Fig. 7. Testbench setup 

it takes into account the relationship between input arrival time 
and Clk-to-Q delay [3]. 

VII. Results 

Fig. 8 show the results. The rising and falling delays for 
the DPSCRFF have been separated out since they differ sig- 
nificantly. The rising delays are negative since the output 
precharges before the input is required to arrive. The flip-flops 
were optimized for the worst-case positive delay, which in some 
cases increases the negative delays. 

As described above, the negative delay can be used to im- 
prove performance or to lower power if skewed logic circuits are 
used. As can be seen, the fastest DPSCRFF at 54 ps is signifi- 
cantly faster than the next fastest flops (HLFF and SSASPL) at 
roughly 150 ps. The lowest-power DPSCRFF at 141 O is com- 
parable to the lowest-power flop (PPCFF) at 1 30 f J. However, it 
has a propagation delay of only 167 ps compared to 342 ps. 

Fig. 9 show how the energy dissipation varies with different 
clock and data input patterns for the different flip-flops. Note 
that the flip-flops shown in this figure have widely varying prop- 
agation delays as shown by the labels in the axes. When the data 
is held low while the clock continues to run, the energy dissi- 
pation of the DPSCRFF is reduced. However, if the clock is 
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Fig. 8. Energy versus delay 
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running and the data is held high, the DPSCRFF actually dissi- 
pates more power than for the full activity waveforms because 
of its output glitches. When the clock is held stable, no internal 
nodes change state and only the single data input gate toggles. 
The DPSCRFF therefore has low energy when the local clock 
is gated. 



VIII. Conclusion 



The DPSCRFF has the smallest D-to-Q delay of published 
flip-flop designs, with comparable energy to the lowest-power 
flip-flop designs. When the clock is gated, the DPSCRFF has 
the lowest possible data input loading (a single transistor gate). 
The asymmetric propagation delay enables the use of highly- 
skewed logic to reduce cycle time and energy. The glitching 
present at the output may cause additional energy dissipation in 
downstream logic dependent on signal statistics. 



